Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra way to check OOM of a command ran inside Docker #3294

Merged
merged 2 commits into from
Nov 22, 2017

Conversation

humitos
Copy link
Member

@humitos humitos commented Nov 21, 2017

Easy way to reproduce this error:

  1. Import a project that compiles lxml (Read the Docs, for example)
  2. The pip command will be killed and Docker won't return a specific
    status code, it will just return 1
  3. In these cases, the Killed word will probably appear in the last
    15 lines

If I increase the DOCKER_LIMITS memory in my local instance I don't get this error.

Easy way to reproduce this error:

1. Import a project that compiles lxml (Read the Docs, for example)
2. The `pip` command will be killed and Docker won't return a specific
status code, it will just return 1
3. In these cases, the `Killed` word will probably appear in the last
15 lines
@humitos
Copy link
Member Author

humitos commented Nov 21, 2017

I think Docker is not reporting a proper OOM exit status code because the command executed was pip, but the ones that it's failing is x86_64-linux-gnu-gcc.

In fact, I can see in the output:

error: command 'x86_64-linux-gnu-gcc' failed with exit status 4

# not use the specific exit code, so we check if the word `Killed`
# is in the last 15 lines of the command's output
killed_in_output = 'Killed' in '\n'.join(self.output.splitlines()[-15:])
if self.exit_code == DOCKER_OOM_EXIT_CODE or killed_in_output:
self.output = _('Command killed due to excessive memory '
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like to override the output of the command here, I think that it should be good to have a specific exception for this and show the error similar than other errors (like timeout, for example)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe 'Killed' is normally the last line of output. Was this not the case for your build? We might be able to severely reduce the number of false positives here by limiting this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we know if this string changes if you run Sphinx with another locale?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also also, perhaps logic should be:

if self.exit_code == DOCKER_OOM_EXIT_CODE or self.exit_code == -1 and killed_in_output

I'm not sure if -1 is correct here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe 'Killed' is normally the last line of output. Was this not the case for your build?

No.

My last lines of the output are:

    building 'lxml.etree' extension
    creating build/temp.linux-x86_64-2.7
    creating build/temp.linux-x86_64-2.7/src
    creating build/temp.linux-x86_64-2.7/src/lxml
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/include/libxml2 -I/tmp/pip-build-_MvJYS/lxml/src/lxml/includes -I/usr/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.7/src/lxml/lxml.etree.o -w
    x86_64-linux-gnu-gcc: internal compiler error: Killed (program cc1)
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 4
    
    ----------------------------------------
Command "/home/humitos/rtfd/code/readthedocs.org/user_builds/rtd/envs/latest/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-_MvJYS/lxml/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-3RRzJB-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/humitos/rtfd/code/readthedocs.org/user_builds/rtd/envs/latest/include/site/python2.7/lxml" failed with error code 1 in /tmp/pip-build-_MvJYS/lxml/

Also, do we know if this string changes if you run Sphinx with another locale?

This is a good question. But I think the locale would be the one used by pip, since it's a command ran by pip but at the end it's x86_64-linux-gnu-gcc what's ran and if the LANG is not exported I would say that it's the system's default one used.

I'm not sure if -1 is correct here

Nop. My exit_code was 1 for this particular case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, pip rather. Not sure if there is a good way to get errors out of pip.

We should check for exit_code == 1 in this logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that.

Do you think that it's better to append our error message to the self.output instead of replacing it?

So, in my case, instead of just Command killed due to excessive memory consumption it would have some information where the user can see that it gets killed because of lxml and maybe it's something that's not needed to build the docs, so the user can mock it.

Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had to do similar hacks for PDF output in the past, but I agree, it's probably necessary.

I didn't think about why this isn't being caught, that's an interesting thought though -- that it's not bubbling up to Docker because its a subprocess. There might be something kernel level we can do here to make this more consistent.

# not use the specific exit code, so we check if the word `Killed`
# is in the last 15 lines of the command's output
killed_in_output = 'Killed' in '\n'.join(self.output.splitlines()[-15:])
if self.exit_code == DOCKER_OOM_EXIT_CODE or killed_in_output:
self.output = _('Command killed due to excessive memory '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe 'Killed' is normally the last line of output. Was this not the case for your build? We might be able to severely reduce the number of false positives here by limiting this.

# not use the specific exit code, so we check if the word `Killed`
# is in the last 15 lines of the command's output
killed_in_output = 'Killed' in '\n'.join(self.output.splitlines()[-15:])
if self.exit_code == DOCKER_OOM_EXIT_CODE or killed_in_output:
self.output = _('Command killed due to excessive memory '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we know if this string changes if you run Sphinx with another locale?

# not use the specific exit code, so we check if the word `Killed`
# is in the last 15 lines of the command's output
killed_in_output = 'Killed' in '\n'.join(self.output.splitlines()[-15:])
if self.exit_code == DOCKER_OOM_EXIT_CODE or killed_in_output:
self.output = _('Command killed due to excessive memory '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also also, perhaps logic should be:

if self.exit_code == DOCKER_OOM_EXIT_CODE or self.exit_code == -1 and killed_in_output

I'm not sure if -1 is correct here

@humitos
Copy link
Member Author

humitos commented Nov 21, 2017

Another example at .org : https://readthedocs.org/projects/jsk-docs/builds/6301911/

It's clear that exit_code wasn't DOCKER_OOM_EXIT_CODE because the last line is Killed\n, so it should have matched and the self.output should have been rewritten.

@agjohnson
Copy link
Contributor

It's clear that exit_code wasn't DOCKER_OOM_EXIT_CODE because the last line is Killed\n, so it should have matched and the self.output should have been rewritten.

Also, the exit code of 137 is special, it means the process got a signal of SIGKILL, and usually means an OOM condition in our case -- we won't kill the process otherwise.

@agjohnson agjohnson added this to the Build stability milestone Nov 21, 2017
@agjohnson agjohnson merged commit d6eac0e into master Nov 22, 2017
@agjohnson agjohnson deleted the humitos/docker/oom-handling branch November 22, 2017 00:01
@humitos
Copy link
Member Author

humitos commented Nov 22, 2017

Yesterday I took a look at the documentation of pip and research a little about how this subprocess call are communicated with pip and what happen when someone else kills the inner process (compiler). I didn't find anything useful, maybe I should jump into the pip's source code but I don't think it worth it at this moment.

I think this it's kind of enought for now.

kakulukia added a commit to kakulukia/readthedocs.org that referenced this pull request Nov 23, 2017
* 'master' of github.com:rtfd/readthedocs.org:
  Add GLOBAL_PIP_CACHE setting (readthedocs#3299)
  Remove invalid attribute `exception` (readthedocs#3298)
  Update build from database at __exit__ (readthedocs#3292)
  Extra way to check OOM of a command ran inside Docker (readthedocs#3294)
  Add supervisord contrib example (readthedocs#3262)
  Allow promos to be placed in the docs footer (readthedocs#3267)
  Do not convert to bytes the `refresh_token` (readthedocs#3273)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants